Nomad List finds you the best places in the world to live, work and play. Every second, it collects 1,000,000+ data points on 2,500+ cities around the world, from cost of living, temperature to safety. With that data, Nomad List gives you an idea of where it’s best for you to live, work and travel. Hopefully my analysis will help you to chose your next traveling destination.
cities_orig <- read_csv('cities_predict.csv')
cities <- cities_orig %>%
mutate(adult_nightlife = as.factor(adult_nightlife),
cashless_society = as.numeric(cashless_society),
cost_of_living = as.factor(cost_of_living),
friendly_to_foreigners =as.factor(round(friendly_to_foreigners)),
female_friendly = as.factor(round(female_friendly)),
freedom_of_speech = as.factor(round(freedom_of_speech)),
fun = as.factor(round(fun)),
happiness = as.factor(round(happiness)),
healthcare = as.factor(round(healthcare)),
lgbt_friendly= as.factor(round(lgbt_friendly)),
nightlife = as.factor(round(nightlife)),
peace = as.factor(round(peace)),
quality_of_life = as.factor(round(quality_of_life)),
racial_tolerance = as.factor(round(racial_tolerance)),
religious_government = as.factor(round(religious_government)),
safe_tap_water = as.factor(round(safe_tap_water)),
safety = as.factor(round(safety)),
startup_score = as.factor(round(startup_score)),
traffic_safety = as.factor(round(traffic_safety)),
walkability = as.factor(round(walkability)))
When I plan to travel in a foreign city, there are 2 main things that I am concerned about. The first one is if the city is safe and the second is approximatly how much it gonna cost us for staying here such as hotel or airbnb expenses If the nomad score is good at selecting traveling city for me, the score will be higher as both level grow bigger.
ggplot(cities,aes(x =`airbnb_(monthly)`, y = nomad_score,color = safety))+
geom_smooth(se = FALSE)
However the graph shows a weried fluctuation of nomad score when the monthly airbnb charges are in a lower stage and safety at level three have in most of the time has a higher nomad score than safety level 4. So I want to dig deeper to find out what exactly is the driven force of the score system.
cities <- cities %>%
select(-cost_of_living_for_local,-region,-country,-place_slug)
library(psych)
cities_factor <- cities %>%
keep(is.factor)
cities_factor <- cities_factor %>%
mutate_all(as.numeric)
psych::fa.parallel(cities_factor)
## Parallel analysis suggests that the number of factors = 5 and the number of components = 2
faTest <- fa(cities_factor,nfactor = 5,rotate ='promax')
kable(faTest$weights)
| MR1 | MR2 | MR5 | MR3 | MR4 | |
|---|---|---|---|---|---|
| adult_nightlife | -0.0117069 | 0.0341672 | 0.0375709 | 0.0867500 | -0.0073499 |
| cost_of_living | -0.0626256 | 0.0726184 | -0.0981808 | -0.0744282 | 0.0171960 |
| female_friendly | 0.1218278 | 0.4720888 | -0.0355599 | 0.0076938 | -0.0591136 |
| freedom_of_speech | 0.0937258 | -0.0468084 | 0.0702228 | -0.0403729 | 0.0036051 |
| friendly_to_foreigners | -0.0054714 | 0.1563487 | 0.0497224 | 0.1075312 | 0.0407458 |
| fun | 0.0061894 | 0.0501262 | 0.0109154 | 0.4675006 | 0.0469023 |
| happiness | -0.0275099 | -0.0266599 | 0.4606967 | 0.0059587 | 0.0433313 |
| healthcare | 0.0967344 | -0.1672075 | -0.0117247 | 0.0820218 | 0.0995685 |
| lgbt_friendly | -0.0240348 | 0.1935276 | 0.2701901 | 0.1624368 | -0.0570423 |
| nightlife | -0.0037987 | 0.0260730 | 0.0486145 | 0.2917165 | 0.0092182 |
| peace | 0.3269393 | 0.0083394 | 0.0818914 | -0.1437692 | -0.0550850 |
| quality_of_life | 0.0190167 | 0.0145873 | 0.0002829 | 0.0431559 | 0.4587251 |
| racial_tolerance | -0.0582284 | 0.1371041 | 0.2218552 | 0.0868623 | 0.0003060 |
| religious_government | 0.0065301 | -0.0851799 | 0.0211889 | -0.0079261 | -0.0254727 |
| safe_tap_water | 0.2965135 | -0.0371447 | -0.0080646 | -0.0486168 | 0.0845021 |
| safety | 0.1064824 | -0.0011249 | -0.1650407 | -0.0092971 | -0.0587193 |
| startup_score | 0.0369492 | 0.1704651 | -0.0288084 | -0.0062270 | 0.4010399 |
| traffic_safety | 0.0877015 | 0.0448513 | -0.1003895 | -0.0160135 | 0.1181129 |
| walkability | -0.0181779 | 0.1210911 | 0.0167551 | 0.0311898 | 0.0052967 |
cities_num <- cities %>%
keep(is.numeric)
cities_fac <- faTest$scores
colnames(cities_fac) <- c('life_quality','diverse_society','happiness_minorities','entertaining_nights','career_development')
cities_tot <- cbind(cities_num,cities_fac)
Because the majority of my variables are abstract adjective categorical variables to describe the feature of the cities such as happiness, safe, and etc. So I use parallel analysis first to find out what is the optimal number of factors I need to use and use factor analysis to change the nominal variables to numeric ones. And based on my new variables correlation with the old ones, I assigned 5 labels that align old variables meanings.
cities_cor <- cities_tot %>%
select(-nomad_score)
cormat <- round(cor(cities_cor),2)
library(reshape2)
melted_cormat <- melt(cormat)
get_lower_tri<-function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
}
# Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}
upper_tri <- get_upper_tri(cormat)
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Heatmap
library(ggplot2)
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "#BA7940", high = "#296375", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
theme(panel.grid.major = element_blank(),panel.grid.minor = element_blank())+
coord_fixed()+
geom_text(aes(Var2, Var1, label = value), color = "black", size = 4)
Although I have turned all of the factor variables into numeric, there are still too many variables and a lot of them may have correlation with each others. I generated a correlation heat map to get rid of the variables with correlations more than 0.5.
cities_tot1 <-cities_cor %>%
select(-`coca-cola`,-cost_of_living_for_expat,-happiness_minorities,-`air_quality_(year-round)`,-life_quality,-`airbnb_(monthly)`)
cities_tot1 <- cbind(cities_tot1,cities$nomad_score)
library(penalized)
netTest <- penalized(response = `cities$nomad_score`, penalized = ~ coffee + internet + diverse_society + entertaining_nights + career_development + cashless_society + `1br_studio_rent_in_center`,
data = cities_tot1, lambda1 = 20,
model = "linear", trace = FALSE)
coefficients(netTest)
(Intercept) internet
3.806289e+00 3.635099e-03
diverse_society entertaining_nights
1.337256e-01 2.712011e-02
career_development `1br_studio_rent_in_center`
1.081393e-01 3.462062e-05
After getting rid of the multicolinearity, it still leve me with 9 variables so I implemented lasso feature selection mothod to furture reduce the variables into 5.
mod1 <- lm(`cities$nomad_score`~. ,data = cities_tot1)
m1 <- step(mod1,direction='backward',trace=FALSE)
I also used a backward stepwise function to cross check if I am dropping the correct variable and find that internet,diversity_society,entertaining_nights, and career_development to be the most important feature in determining the nomad score. After comparing these 2 different methond of feature selection, I decided to drop 1br studio rent variable because it has overlap with career development and the effect seems very little.
cities_tot2 <- cities_tot1 %>%
mutate(internet = internet-mean(internet),
diverse_society = diverse_society-mean(diverse_society),
entertaining_nights=entertaining_nights-mean(entertaining_nights),
career_development=career_development-mean(career_development)) %>%
dplyr::select(-`1br_studio_rent_in_center`,-cashless_society,-coffee)
colnames(cities_tot2) <- c('internet','diverse_society','entertaining_nights','career_development','nomad_score')
I want to make my intercept more meaningful and I checked that expect for internet speed, all the other three variable won’t make sense if the value is zero. So I centerd all other three variables to prepare for my analysis.
H0: Internet speed has no impact on the nomad scores of the cities.
H1: Internet speed has a impact on the nomad scores of the cities.
H0: A diversed society has no impact on the nomad scores of the cities.
H1: A diversed society has a impact on the nomad scores of the cities.
H0: Entertaining night lives have no impact on the nomad scores of the cities.
H1: Entertaining night lives have a impact on the nomad scores of the cities.
H0: Career development has no impact on the nomad scores of the cities.
H1: Career development has a impact on the nomad scores of the cities.
mod2 <- lm(nomad_score ~ internet + diverse_society +
entertaining_nights + career_development, data = cities_tot2)
summary(mod2)
Call:
lm(formula = nomad_score ~ internet + diverse_society + entertaining_nights +
career_development, data = cities_tot2)
Residuals:
Min 1Q Median 3Q Max
-1.23548 -0.16330 0.02788 0.20690 1.01953
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.9065608 0.0121499 321.529 < 2e-16 ***
internet 0.0028061 0.0007529 3.727 0.000209 ***
diverse_society 0.1520363 0.0191510 7.939 7.77e-15 ***
entertaining_nights 0.0508581 0.0187000 2.720 0.006691 **
career_development 0.1371612 0.0179893 7.625 7.69e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3283 on 725 degrees of freedom
Multiple R-squared: 0.428, Adjusted R-squared: 0.4248
F-statistic: 135.6 on 4 and 725 DF, p-value: < 2.2e-16
From the liner output we can see that all of the 4 variables are significant with a p value less than 0.05. A unit change in internet speed, diverse_society, entertaining_nights, career_development will increase 0.002,0.15,0.05 and 0.13 repectively in nomad score with other variables remain unchanged. And if we have average internet speed, diverse_society, entertaining_nights and career_development we will have a 3.9 average nomad score.
par(mfrow = c(2, 2))
plot(mod2)
library(MASS)
robTest <- rlm(nomad_score ~ internet + diverse_society +
entertaining_nights + career_development, data = cities_tot2, psi = psi.bisquare)
summary(robTest)
Call: rlm(formula = nomad_score ~ internet + diverse_society + entertaining_nights +
career_development, data = cities_tot2, psi = psi.bisquare)
Residuals:
Min 1Q Median 3Q Max
-1.275056 -0.193962 0.004057 0.186020 0.990478
Coefficients:
Value Std. Error t value
(Intercept) 3.9318 0.0112 349.6585
internet 0.0024 0.0007 3.3737
diverse_society 0.1457 0.0177 8.2187
entertaining_nights 0.0602 0.0173 3.4800
career_development 0.1441 0.0166 8.6525
Residual standard error: 0.2781 on 725 degrees of freedom
comparisonData = data.frame(weights = robTest$w,
residuals = robTest$residuals)
kable(head(comparisonData, 10))
| weights | residuals |
|---|---|
| 0.9960797 | -0.0577488 |
| 0.8249809 | 0.3945969 |
| 0.3570616 | 0.8266246 |
| 0.3207169 | 0.8581104 |
| 0.9784778 | -0.1355662 |
| 0.9658876 | -0.1709297 |
| 0.6713078 | 0.5538481 |
| 0.6494323 | 0.5741050 |
| 0.9584990 | 0.1886721 |
| 0.6910880 | -0.5352109 |
kable(head(comparisonData[order(comparisonData$weights), ], 10))
| weights | residuals | |
|---|---|---|
| 281 | 0.0018131 | -1.2750557 |
| 379 | 0.0262037 | -1.1929651 |
| 74 | 0.0942716 | -1.0847585 |
| 92 | 0.1084575 | -1.0671821 |
| 289 | 0.1095786 | -1.0658201 |
| 173 | 0.1116113 | -1.0633865 |
| 80 | 0.1190607 | -1.0545679 |
| 498 | 0.1261058 | -1.0464453 |
| 343 | 0.1322246 | -1.0395185 |
| 46 | 0.1782617 | 0.9904778 |
From the Residuals vs Fitted plot I have generated, the dots seem pretty random so there is less likely that we are having a heteroscedasticity problem. However the Residuals vs Leverage plot we can see there are dash lines in the graphy and there are several points that are really close to the dash lines which means that it is highly likely that we have a outlier problem. And I used a robust linear regression to lower the weight of my outliers and get the better linear model.
fsquare = summary(mod2)$adj.r.squared/(1-summary(mod2)$adj.r.squared)
library(pwr)
pwr.f2.test(u = 4, v = 730, f2 = fsquare, power =NULL)
Multiple regression power calculation
u = 4
v = 730
f2 = 0.7386699
sig.level = 0.05
power = 1
By generating my post hoc power analysis using f2 and example size of 730, I got the power of 1 which means my model has sufficient power.
cities_logreg <- cities_tot2 %>%
mutate(outcome = ifelse(nomad_score > 4, 1,0)) %>%
mutate(outcome = as.factor(outcome)) %>%
dplyr::select(-nomad_score)
library(caret)
set.seed(1234)
sample.set <- createDataPartition(cities_logreg$outcome, p = 0.75, list = FALSE)
cities_train <- cities_logreg[sample.set, ]
cities_test <- cities_logreg[-sample.set, ]
logit_mod <-
glm(outcome ~ ., family = binomial(link = 'logit'), data = cities_train)
#' View the results of the model.
library(InformationValue)
summary(logit_mod)
Call:
glm(formula = outcome ~ ., family = binomial(link = "logit"),
data = cities_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1825 -0.8194 -0.2407 0.8392 2.2847
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.179507 0.107082 -1.676 0.09367 .
internet 0.008964 0.007080 1.266 0.20546
diverse_society 1.044359 0.175622 5.947 2.74e-09 ***
entertaining_nights 0.363079 0.157185 2.310 0.02089 *
career_development 0.591325 0.198343 2.981 0.00287 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 758.46 on 547 degrees of freedom
Residual deviance: 556.63 on 543 degrees of freedom
AIC: 566.63
Number of Fisher Scoring iterations: 5
logit_pred <- predict(logit_mod, cities_test, type = 'response')
ideal_cutoff <-
optimalCutoff(
actuals = cities_test$outcome,
predictedScores = logit_pred,
optimiseFor = "Both"
)
# Transformm predictions based on ideal cutoff.
logit_pred <- ifelse(logit_pred > ideal_cutoff, 1, 0)
#' What does our confusion matrix now look like?
logit_pred_table <- table(cities_test$outcome, logit_pred)
logit_pred_table
logit_pred
0 1
0 74 21
1 20 67
#' Now, what is our accuracy?
sum(diag(logit_pred_table)) / nrow(cities_test)
[1] 0.7747253
Since for the linear model I got the result of 3.9 average ciy score. So if the nomad score is beyond 4 we will recommend that and I assigned the value to 1 and if the score is below 4 then I assigned the value to 0. I run a logistic regression to see the prediction value of my model.And in my logistic model, internet doesn’t significant anymore. The driven force of the nomad score is on And I get a accuracy of 77% which is nod bad. So the driven force are the 4 variables,internet,diverse_society,entertaining_nights and career_development.
library(leaflet)
library(leafpop)
mybins <- seq(4.5, 5, by=0.1)
mypalette <- colorBin(palette="YlGnBu", domain=cities_ideal$nomad_score, na.color="transparent", bins=mybins)
# Prepare the text for the tooltip:
mytext <- paste(
"City: ", cities_ideal1$place_slug, "<br/>",
"Nomad Score: ",cities_ideal$nomad_score, "<br/>",sep="") %>%
lapply(htmltools::HTML)
cities_ideal$remotePath <- paste0("https://github.com/eeeeevahe/pictures/raw/master/",cities_ideal$cityname,".jpg")
citimap1 <- leaflet(cities_ideal) %>%
addTiles() %>%
# add markers
setView(lat=27, lng=30,zoom=2) %>%
addCircleMarkers(lng=cities_ideal$long, lat=cities_ideal$lat,popup=paste(popupImage(cities_ideal$remotePath, src = c("remote"))),
fillColor = ~mypalette(cities_ideal$nomad_score), fillOpacity = 0.7, color="white", radius=8, stroke=FALSE,
label = mytext,
labelOptions = labelOptions( style = list("font-weight" = "normal", padding = "3px 8px"), textsize = "13px", direction = "auto")
) %>%
addLegend( pal=mypalette, values=cities_ideal$nomad_score, opacity=0.9, title = "NomadScore", position = "bottomright" )
citimap1
I filtered the score beyond 4.5 to find some ideal place to go. This rating system may not work for me because it doesn’t put a lot of weight on safety and other things I am concerned about. But my analysis still works for those people who take So if you want to go to the Southeast Asian then ChiangMai in Thailand with a score of 4.97 is a great place to go. If exploring Europe is your goal then Budapest in Hungary with a score of 4.93.
A work by Eva He
yhe24@nd.edu